11 research outputs found
On the Importance of Visual Context for Data Augmentation in Scene Understanding
Performing data augmentation for learning deep neural networks is known to be
important for training visual recognition systems. By artificially increasing
the number of training examples, it helps reducing overfitting and improves
generalization. While simple image transformations can already improve
predictive performance in most vision tasks, larger gains can be obtained by
leveraging task-specific prior knowledge. In this work, we consider object
detection, semantic and instance segmentation and augment the training images
by blending objects in existing scenes, using instance segmentation
annotations. We observe that randomly pasting objects on images hurts the
performance, unless the object is placed in the right context. To resolve this
issue, we propose an explicit context model by using a convolutional neural
network, which predicts whether an image region is suitable for placing a given
object or not. In our experiments, we show that our approach is able to improve
object detection, semantic and instance segmentation on the PASCAL VOC12 and
COCO datasets, with significant gains in a limited annotation scenario, i.e.
when only one category is annotated. We also show that the method is not
limited to datasets that come with expensive pixel-wise instance annotations
and can be used when only bounding boxes are available, by employing
weakly-supervised learning for instance masks approximation.Comment: Updated the experimental section. arXiv admin note: substantial text
overlap with arXiv:1807.0742
BlitzNet: A Real-Time Deep Network for Scene Understanding
Real-time scene understanding has become crucial in many applications such as
autonomous driving. In this paper, we propose a deep architecture, called
BlitzNet, that jointly performs object detection and semantic segmentation in
one forward pass, allowing real-time computations. Besides the computational
gain of having a single network to perform several tasks, we show that object
detection and semantic segmentation benefit from each other in terms of
accuracy. Experimental results for VOC and COCO datasets show state-of-the-art
performance for object detection and segmentation among real time systems
Diversity with Cooperation: Ensemble Methods for Few-Shot Classification
Added experiments with different network architectures and input image resolutionsInternational audienc
SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models
Understanding dynamics from visual observations is a challenging problem that
requires disentangling individual objects from the scene and learning their
interactions. While recent object-centric models can successfully decompose a
scene into objects, modeling their dynamics effectively still remains a
challenge. We address this problem by introducing SlotFormer -- a
Transformer-based autoregressive model operating on learned object-centric
representations. Given a video clip, our approach reasons over object features
to model spatio-temporal relationships and predicts accurate future object
states. In this paper, we successfully apply SlotFormer to perform video
prediction on datasets with complex object interactions. Moreover, the
unsupervised SlotFormer's dynamics model can be used to improve the performance
on supervised downstream tasks, such as Visual Question Answering (VQA), and
goal-conditioned planning. Compared to past works on dynamics modeling, our
method achieves significantly better long-term synthesis of object dynamics,
while retaining high quality visual generation. Besides, SlotFormer enables VQA
models to reason about the future without object-level labels, even
outperforming counterparts that use ground-truth annotations. Finally, we show
its ability to serve as a world model for model-based planning, which is
competitive with methods designed specifically for such tasks.Comment: Accepted by ICLR 2023. Project page: https://slotformer.github.io
Self-Supervised Learning of Action Affordances as Interaction Modes
When humans perform a task with an articulated object, they interact with the
object only in a handful of ways, while the space of all possible interactions
is nearly endless. This is because humans have prior knowledge about what
interactions are likely to be successful, i.e., to open a new door we first try
the handle. While learning such priors without supervision is easy for humans,
it is notoriously hard for machines. In this work, we tackle unsupervised
learning of priors of useful interactions with articulated objects, which we
call interaction modes. In contrast to the prior art, we use no supervision or
privileged information; we only assume access to the depth sensor in the
simulator to learn the interaction modes. More precisely, we define a
successful interaction as the one changing the visual environment substantially
and learn a generative model of such interactions, that can be conditioned on
the desired goal state of the object. In our experiments, we show that our
model covers most of the human interaction modes, outperforms existing
state-of-the-art methods for affordance learning, and can generalize to objects
never seen during training. Additionally, we show promising results in the
goal-conditional setup, where our model can be quickly fine-tuned to perform a
given task. We show in the experiments that such affordance learning predicts
interaction which covers most modes of interaction for the querying articulated
object and can be fine-tuned to a goal-conditional model. For supplementary:
https://actaim.github.io
StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos
Instructional videos are an important resource to learn procedural tasks from
human demonstrations. However, the instruction steps in such videos are
typically short and sparse, with most of the video being irrelevant to the
procedure. This motivates the need to temporally localize the instruction steps
in such videos, i.e. the task called key-step localization. Traditional methods
for key-step localization require video-level human annotations and thus do not
scale to large datasets. In this work, we tackle the problem with no human
supervision and introduce StepFormer, a self-supervised model that discovers
and localizes instruction steps in a video. StepFormer is a transformer decoder
that attends to the video with learnable queries, and produces a sequence of
slots capturing the key-steps in the video. We train our system on a large
dataset of instructional videos, using their automatically-generated subtitles
as the only source of supervision. In particular, we supervise our system with
a sequence of text narrations using an order-aware loss function that filters
out irrelevant phrases. We show that our model outperforms all previous
unsupervised and weakly-supervised approaches on step detection and
localization by a large margin on three challenging benchmarks. Moreover, our
model demonstrates an emergent property to solve zero-shot multi-step
localization and outperforms all relevant baselines at this task.Comment: CVPR'2
LabelFormer: Object Trajectory Refinement for Offboard Perception from LiDAR Point Clouds
A major bottleneck to scaling-up training of self-driving perception systems
are the human annotations required for supervision. A promising alternative is
to leverage "auto-labelling" offboard perception models that are trained to
automatically generate annotations from raw LiDAR point clouds at a fraction of
the cost. Auto-labels are most commonly generated via a two-stage approach --
first objects are detected and tracked over time, and then each object
trajectory is passed to a learned refinement model to improve accuracy. Since
existing refinement models are overly complex and lack advanced temporal
reasoning capabilities, in this work we propose LabelFormer, a simple,
efficient, and effective trajectory-level refinement approach. Our approach
first encodes each frame's observations separately, then exploits
self-attention to reason about the trajectory with full temporal context, and
finally decodes the refined object size and per-frame poses. Evaluation on both
urban and highway datasets demonstrates that LabelFormer outperforms existing
works by a large margin. Finally, we show that training on a dataset augmented
with auto-labels generated by our method leads to improved downstream detection
performance compared to existing methods. Please visit the project website for
details https://waabi.ai/labelformerComment: 20 pages, 8 figures, 7 table
On the Importance of Visual Context for Data Augmentation in Scene Understanding
International audiencePerforming data augmentation for learning deep neural networks is known to be important for training visual recognition systems. By artificially increasing the number of training examples, it helps reducing overfitting and improves generalization. While simple image transformations can already improve predictive performance in most vision tasks, larger gains can be obtained by leveraging task-specific knowledge. In this work, we consider object detection, semantic and instance segmentation and augment training images by blending objects in existing scenes, using instance segmentation annotations. We observe that randomly pasting objects on images hurts the performance, unless the object is placed in the right context. To resolve this issue, we propose an explicit context model by using a convolutional neural network, which predicts whether an image region is suitable for placing a given object or not. In our experiments, we show that our approach is able to improve object detection, semantic and instance segmentation on the PASCAL VOC12 and COCO datasets, with significant gains in a limited annotation scenario. We also show that the method is not limited to datasets that come with expensive pixel-wise instance annotations and can be used when only bounding boxes are available, by employing weakly-supervised learning for instance masks approximation
Diversity with Cooperation: Ensemble Methods for Few-Shot Classification
Added experiments with different network architectures and input image resolutionsInternational audienceFew-shot classification consists of learning a predictive model that is able to effectively adapt to a new class, given only a few annotated samples. To solve this challenging problem, meta-learning has become a popular paradigm that advocates the ability to "learn to adapt''. Recent works have shown, however, that simple learning strategies without meta-learning could be competitive. In this paper, we go a step further and show that by addressing the fundamental high-variance issue of few-shot learning classifiers, it is possible to significantly outperform current meta-learning techniques. Our approach consists of designing an ensemble of deep networks to leverage the variance of the classifiers, and introducing new strategies to encourage the networks to cooperate, while encouraging prediction diversity. Evaluation is conducted on the mini-ImageNet, tiered-ImageNet and CUB datasets, where we show that even a single network obtained by distillation yields state-of-the-art results